Modeling ETL Data Quality Enforcement Tasks Using Relational Algebra Operators
نویسندگان
چکیده
Usually, a data warehouse is refreshed periodically with data gathered from disparate source systems. Nevertheless this data might not be fully accurate, probably containing serious data quality problems, such as uniqueness, misrepresentations, null values, multi-purpose fields, or inconsistent values, for one or more attributes. This is a major contribution to the falling expectations users have on data analyzed from data warehouses. Data quality enforcement is a complex time consuming task that parses data from source tables and corrects it, normalizes it and integrates it into a data warehouse for a better representation of real businesses. In this paper, we analyze some of the common tasks that are associated with data quality enforcement, representing and modeling them using Relational Algebra as specification tool. © 2013 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of CENTERIS/HCIST.
منابع مشابه
Extending Relational Algebra to express one-to-many data transformations
Application scenarios such as legacy-data migration, ETL processes, data cleaning and data-integration require the transformation of input tuples into output tuples. Traditional approaches for implementing these data transformations enclose solutions as Persistent Stored Modules (PSM) executed by an RDBMS or transformation code using a commercial ETL tool. Neither of these solutions is easily m...
متن کاملExtending the Relational Algebra with the Mapper Operator
Application scenarios such as legacy data migration, Extract-TransformLoad (ETL) processes, and data cleaning require the transformation of input tuples into output tuples. Traditional approaches for implementing these data transformations enclose solutions as Persistent Stored Modules (PSM) executed by an RDBMS or transformation code using a commercial ETL tool. Neither of these is easily main...
متن کاملHybrid: A Large-scale In-memory Image Analytics Engine
Analytical image/video processing tasks such as scene/face/activity recognition are historically performed outside most relational database management systems. Relational engines are optimized for relational data, hence naturally have weaker support for non-relational data such as images or video. Hybrid, a high-velocity in-memory analytics engine, supports advanced access capabilities to both ...
متن کاملFrom SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra
MapReduce-based data processing platforms offer a promising approach for cost-effective and Web-scale processing of Semantic Web data. However, one major challenge is that this computational paradigm leads to high I/O and communication costs when processing tasks with several join operations typical in SPARQL queries. The goal of this demonstration is to show how a system RAPID+, an extension o...
متن کاملAn ETL Metadata Model for Data Warehousing
Metadata is essential for understanding information stored in data warehouses. It helps increase levels of adoption and usage of data warehouse data by knowledge workers and decision makers. A metadata model is important to the implementation of a data warehouse; the lack of a metadata model can lead to quality concerns about the data warehouse. A highly successful data warehouse implementation...
متن کامل